Despite being a paradigm of quantitative linguistics, Zipf's law for wordssuffers from three main problems: its formulation is ambiguous, its validityhas not been tested rigorously from a statistical point of view, and it has notbeen confronted to a representatively large number of texts. So, we cansummarize the current support of Zipf's law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf'slaw and fitting them to all available English texts in the Project Gutenbergdatabase (consisting of more than 30000 texts). To do so we use state-of-theart tools in fitting and goodness-of-fit tests, carefully tailored to thepeculiarities of text statistics. Remarkably, one of the three versions ofZipf's law, consisting of a pure power-law form in the complementary cumulativedistribution function of word frequencies, is able to fit more than 40% of thetexts in the database (at the 0.05 significance level), for the whole domain offrequencies (from 1 to the maximum value) and with only one free parameter (theexponent).
展开▼